Dynamic Resource Management and Job Scheduling for High Performance Computing = Dynamisches Ressourcenmanagement und Job-Scheduling für das Hochleistungsrechnen

نویسنده

  • Suraj Prabhakaran
چکیده

Job scheduling and resource management plays an essential role in high-performance computing. Supercomputing resources are usually managed by a batch system, which is responsible for the effective mapping of jobs onto resources (i.e., compute nodes). From the system perspective, a batch system must ensure high system utilization and throughput, while from the user perspective it must ensure fast response times and fairness when allocating resources across jobs. Parallel jobs can be divided into four categories rigid, moldable, malleable, and evolving. While rigid jobs have fixed resource requirements over their entire life cycle, moldable jobs allow batch systems to deviate from the requested number of resources before job start. In contrast, malleable and evolving jobs can adapt to changing resource allocations at runtime. While batch systems can expand or shrink a malleable job’s resource allocation at any point of time, expanding and shrinking an evolving job occurs only in response to a request made by the application itself. Traditional batch systems support only rigid and moldable jobs, that is, they perform static resource management. However, this is not sufficient as supercomputing enters a new era. Scientific applications are becoming much more complex and now often exhibit unpredictably changing resource requirements. Programming models are also becoming more adaptive in nature to support malleability for energy efficiency and fault tolerance. Therefore, scheduling evolving and malleable jobs (i.e., dynamic resource management) will be indispensable, especially on future large-scale systems. This dissertation therefore proposes novel dynamic resource management and scheduling techniques for cluster systems, making multiple contributions in the areas of dynamic resource (de)allocation mechanisms, efficient adaptive job scheduling, and resiliency. As the first contribution, this thesis presents dynamic scheduling methods for evolving jobs. A fairness scheme is proposed to ensure the fair allocation of resources between static and dynamic resource requests. The evaluation with a workload containing both rigid and evolving jobs shows that high resource utilization and throughput can be achieved, while maintaining the fair dynamic assignment of resources. It is also demonstrated how these methods can be beneficially employed in heterogeneous architectures with network-attached accelerators. The second contribution presents a unique scheduling technique for malleable jobs and an algorithm for the combined scheduling of all four types of jobs in a cluster environment. We introduce the Dependency-based Expand/Shrink (DBES) algorithm, which rests on a two-phase malleable job expand/shrink strategy. The batch system is evaluated with a mixed workload and our strategy achieves consistently superior performance in comparison to state-of-the-art malleable job scheduling strategies. Finally, as the last contribution, we present a scheduling algorithm for dynamic node replacement, which improves the resiliency of cluster systems. The algorithm uses the unique features of the four job types and can provide replacement nodes instantly to jobs affected by node failures. Among current fault tolerance mechanisms, our technique causes the smallest loss of throughput.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Integrated modeling and solving the resource allocation problem and task scheduling in the cloud computing environment

Cloud computing is considered to be a new service provider technology for users and businesses. However, the cloud environment is facing a number of challenges. Resource allocation in a way that is optimum for users and cloud providers is difficult because of lack of data sharing between them. On the other hand, job scheduling is a basic issue and at the same time a big challenge in reaching hi...

متن کامل

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

Data Replication-Based Scheduling in Cloud Computing Environment

Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...

متن کامل

An Agent Based Dynamic Resource Scheduling Model with FCFS-Job Grouping Strategy in Grid Computing

Grid computing is a group of clusters connected over high-speed networks that involves coordinating and sharing computational power, data storage and network resources operating across dynamic and geographically dispersed locations. Resource management and job scheduling are critical tasks in grid computing. Resource selection becomes challenging due to heterogeneity and dynamic availability of...

متن کامل

Grouping-Based Job Scheduling Model In Grid Computing

Grid computing is a high performance computing environment to solve larger scale computational applications. Grid computing contains resource management, job scheduling, security problems, information management and so on. Job scheduling is a fundamental and important issue in achieving high performance in grid computing systems. However, it is a big challenge to design an efficient scheduler a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016